Detecting Near-Duplicates in Large-Scale Short Text Databases

نویسندگان

  • Caichun Gong
  • Yulan Huang
  • Xueqi Cheng
  • Shuo Bai
چکیده

Near-duplicates are abundant in short text databases. Detecting and eliminating them is of great importance. SimFinder proposed in this paper is a fast algorithm to identify all nearduplicates in large-scale short text databases. An ad hoc term weighting scheme is employed to measure each term’s discriminative ability. A certain number of terms are extracted to form a feature list for each short text. SimFinder generates several fingerprints for each feature list, and only texts with the same fingerprint are compared with each other. An optimization procedure is employed in SimFinder to make it more efficient. Experiments indicate that SimFinder is an effective solution for short text duplicate detection with almost linear time and storage complexity . Both precision and recall of SimFinder are promising.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Web-Scale Near-Duplicate Search: Techniques and Applications

A s the bandwidth accessible to average users has increased, audiovisual material has become the fastest growing datatype on the Internet. The impressive growth of the social Web, where users can exchange user-generated content, contributes to the overwhelming number of multimedia files available. Among these huge volumes of data, a large numbers of near duplicates and copies exist. File copies...

متن کامل

SimHash-based Effective and Efficient Detecting of Near-Duplicate Short Messages

Detecting near-duplicates within huge repository of short message is known as a challenge due to its short length, frequent happenings of typo when typing on mobile phone, flexibility and diversity nature of Chinese language, and the target we prefer, near-duplicate. In this paper, we discuss the real problem met in real application, and try to look for a suitable technique to solve this proble...

متن کامل

Detecting Near Duplicates in Software Documentation

Contemporary software documentation is as complicated as the software itself. During its lifecycle, the documentation accumulates a lot of “near duplicate” fragments, i.e. chunks of text that were copied from a single source and were later modified in different ways. Such near duplicates decrease documentation quality and thus hamper its further utilization. At the same time, they are hard to d...

متن کامل

Detecting duplicates among symbolically compressed images in a large document database

The detection of duplicate images is a useful means of indexing a large database of documents. An algorithm for duplicate document detection is proposed in this paper that operates directly on images that have been symbolically compressed using techniques related to the ongoing JBIG2 standardization e€ort. This paper describes a hidden Markov model (HMM) method that recognizes the text in an im...

متن کامل

An image signature for any kind of image

We describe an algorithm for computing an image signature, suitable for first-stage screening for duplicate images. Our signature relies on relative brightness of image regions, and is generally applicable to photographs, text documents, and line art. We give experimental results on the sensitivity and robustness of signatures for actual image collections, and also results on the robustness of ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008